Over the last 20 years, a plethora of video games have been released in an ever growing market of gaming consoles. Given the diversity, I would like to determine if, by year and over the years, there is a particular console that is the most popular by use and by average game rating. In addition, I would like to determine if, by year and over the years within reason, there is a particular video game genre that is heralded as the favorite by rating and amount of games within that genre with a positive rating.

Loading the packages and the dataset

#For visualization
library(ggplot2)
library(plotly)

#For data manipulation
library(readr)
library(dplyr)
library(tidyr)
library(rmarkdown)
library(lubridate)
library(knitr)

#Word cloud
library(tidytext)
library(reshape2)
library(wordcloud)
IGN_data <- read.csv("ign_clean.csv")
summary(IGN_data)
##       X.1              X           score_phrase 
##  Min.   :    1   Min.   :    0   Great   :4773  
##  1st Qu.: 4657   1st Qu.: 4656   Good    :4741  
##  Median : 9313   Median : 9312   Okay    :2945  
##  Mean   : 9313   Mean   : 9312   Mediocre:1959  
##  3rd Qu.:13969   3rd Qu.:13968   Amazing :1804  
##  Max.   :18625   Max.   :18624   Bad     :1269  
##                                  (Other) :1134  
##                                      title      
##  Cars                                   :   10  
##  Madden NFL 07                          :   10  
##  Open Season                            :   10  
##  Brain Challenge                        :    9  
##  LEGO Star Wars II: The Original Trilogy:    9  
##  Madden NFL 08                          :    9  
##  (Other)                                :18568  
##                                             url       
##  /games/aladdin/gba-566703                    :    2  
##  /games/big-league-sports/wii-14275098        :    2  
##  /games/blur/xbox-360-14222096                :    2  
##  /games/call-of-duty-modern-warfare-2/ps3-2550:    2  
##  /games/crash-twinsanity/ps2-667247           :    2  
##  /games/defiance/pc-71832                     :    2  
##  (Other)                                      :18613  
##           platform        score             genre      editors_choice
##  PC           :3370   Min.   : 0.50   Action   :3797   N:15108       
##  PlayStation 2:1686   1st Qu.: 6.00   Sports   :1916   Y: 3517       
##  Xbox 360     :1631   Median : 7.30   Shooter  :1610                 
##  Wii          :1366   Mean   : 6.95   Racing   :1228                 
##  PlayStation 3:1356   3rd Qu.: 8.20   Adventure:1175                 
##  Nintendo DS  :1045   Max.   :10.00   Strategy :1071                 
##  (Other)      :8171                   (Other)  :7828                 
##   release_year  release_month     release_day        platform_group
##  Min.   :1996   Min.   : 1.000   Min.   : 1.00   PlayStation:5059  
##  1st Qu.:2003   1st Qu.: 4.000   1st Qu.: 8.00   Nintendo   :3906  
##  Median :2007   Median : 8.000   Median :16.00   Windows    :3386  
##  Mean   :2007   Mean   : 7.139   Mean   :15.61   Xbox       :2660  
##  3rd Qu.:2010   3rd Qu.:10.000   3rd Qu.:23.00   Apple      :1039  
##  Max.   :2016   Max.   :12.000   Max.   :31.00   Game Boy   :1001  
##                                                  (Other)    :1574  
##     genre_group  
##  Action   :5894  
##  Sports   :1988  
##  Shooter  :1614  
##  Adventure:1267  
##  Racing   :1258  
##  RPG      :1089  
##  (Other)  :5515

By looking at glimpse we already know that, but you can also check dimension to check number of rows and columns and column names with dim and names functions.

dim(IGN_data)
## [1] 18625    14
names(IGN_data)
##  [1] "X.1"            "X"              "score_phrase"   "title"         
##  [5] "url"            "platform"       "score"          "genre"         
##  [9] "editors_choice" "release_year"   "release_month"  "release_day"   
## [13] "platform_group" "genre_group"

I will finish the first exploration with that functions. We will go soon more deeply in particular column attributes. Now, after looking at the data I made a list of few comments to be checked and probably cleaned up.

Clean up tasks:

Clean up

Remove unuseful columns

IGN_data<- IGN_data[,-c(1,4)]

Unite date columns and change format

IGN_data<- IGN_data %>% 
  unite(date, release_day, release_month, release_year, sep="/") 

IGN_data$date<- as.Date(IGN_data$date,"%d/%m/%Y")
glimpse(IGN_data)
## Observations: 18,625
## Variables: 10
## $ X              <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ score_phrase   <fct> Amazing, Amazing, Great, Great, Great, Good, Aw...
## $ url            <fct> /games/littlebigplanet-vita/vita-98907, /games/...
## $ platform       <fct> PlayStation Vita, PlayStation Vita, iPad, Xbox ...
## $ score          <dbl> 9.0, 9.0, 8.5, 8.5, 8.5, 7.0, 3.0, 9.0, 3.0, 7....
## $ genre          <fct> Platformer, Platformer, Puzzle, Sports, Sports,...
## $ editors_choice <fct> Y, Y, N, N, N, N, N, Y, N, N, N, N, N, Y, Y, N,...
## $ date           <date> 2012-09-12, 2012-09-12, 2012-09-12, 2012-09-11...
## $ platform_group <fct> PlayStation, PlayStation, Apple, Xbox, PlayStat...
## $ genre_group    <fct> Platformer, Platformer, Puzzle, Sports, Sports,...

Transform score_phrase and editors_choice to factor

Score phrase

unique(IGN_data$score_phrase)
##  [1] Amazing     Great       Good        Awful       Okay       
##  [6] Mediocre    Bad         Painful     Unbearable  Disaster   
## [11] Masterpiece
## 11 Levels: Amazing Awful Bad Disaster Good Great Masterpiece ... Unbearable
IGN_data$score_phrase<- as.factor(IGN_data$score_phrase)

Editors_choice

unique(IGN_data$editors_choice)
## [1] Y N
## Levels: N Y
IGN_data$editors_choice<- as.factor(IGN_data$editors_choice)

Check

glimpse(IGN_data)
## Observations: 18,625
## Variables: 10
## $ X              <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ score_phrase   <fct> Amazing, Amazing, Great, Great, Great, Good, Aw...
## $ url            <fct> /games/littlebigplanet-vita/vita-98907, /games/...
## $ platform       <fct> PlayStation Vita, PlayStation Vita, iPad, Xbox ...
## $ score          <dbl> 9.0, 9.0, 8.5, 8.5, 8.5, 7.0, 3.0, 9.0, 3.0, 7....
## $ genre          <fct> Platformer, Platformer, Puzzle, Sports, Sports,...
## $ editors_choice <fct> Y, Y, N, N, N, N, N, Y, N, N, N, N, N, Y, Y, N,...
## $ date           <date> 2012-09-12, 2012-09-12, 2012-09-12, 2012-09-11...
## $ platform_group <fct> PlayStation, PlayStation, Apple, Xbox, PlayStat...
## $ genre_group    <fct> Platformer, Platformer, Puzzle, Sports, Sports,...

The dataset looks much cleaner now. When this step is ended it is time to go more deeply into the data. Although it is now seen what type of data we have, there is still no information about the columns content, especially games genre and relations between columns. Little visualisation would imagine some of them.

Explore the data

Score_phrase

We’ve already found out that it represent factor variables and also can name it. Let’s repeat their names and check how many games are in each score phrase. I will use plot_ly function. I am still learning all the opportunities of that huge package perfect for any visualisation.

#Score phrases:
levels(IGN_data$score_phrase)
##  [1] "Amazing"     "Awful"       "Bad"         "Disaster"    "Good"       
##  [6] "Great"       "Masterpiece" "Mediocre"    "Okay"        "Painful"    
## [11] "Unbearable"
IGN_data %>% 
  count(score_phrase) %>% 
  plot_ly(x = ~score_phrase, y = ~n) %>% 
  add_bars()

Doesn’t show much especially when score_phrase is not arrange correctly- so from most negative to most positive or vice versa

#I will arrange a bit the columns from positive to negative score phrase as I supposed they should be. 
order<- c("Disaster","Painful","Awful","Unbearable","Bad","Mediocre","Okay","Good","Great","Amazing","Masterpiece")

IGN_data$score_phrase<- ordered(IGN_data$score_phrase, levels = order)

IGN_data %>% 
  count(score_phrase) %>% 
  plot_ly(x = ~score_phrase, y = ~n) %>% 
  add_bars()

Tend is visible growing. Comparing well reviewed games to negatively reviewed one the proportion is big. Looking at the chart for the first time I saw big proportion of Masterpiece and Disaster games, but when looking for the second one I saw there is not so few negative score phrases, there are more than 2300 games.

Looking at the chart it is also truly seen than reviewers do not use extreme phrase scores very often either Masterpiece or Disaster. Let’s see which games have that honour to get so many maximum scores and Masterpiece score_phrase (Output of 30 records):

IGN_data<- arrange(IGN_data, desc(date)) 

masterpiece<- IGN_data[,-c(3,7)] %>% 
  filter(score_phrase == "Masterpiece" ) %>% 
  unique()
  
kable(masterpiece[1:30, ], caption= "Masterpiece score_phrase records")
Masterpiece score_phrase records
X score_phrase platform score genre date platform_group genre_group
18511 Masterpiece PlayStation 4 10 Adventure 2016-06-28 PlayStation Adventure
18623 Masterpiece Xbox One 10 Adventure 2016-06-28 Xbox Adventure
18624 Masterpiece PC 10 Adventure 2016-06-28 Windows Adventure
18352 Masterpiece PlayStation 4 10 Puzzle 2016-01-25 PlayStation Puzzle
18353 Masterpiece PC 10 Puzzle 2016-01-25 Windows Puzzle
18354 Masterpiece Xbox One 10 Puzzle 2016-01-25 Xbox Puzzle
18417 Masterpiece PC 10 RPG 2016-01-13 Windows RPG
18418 Masterpiece Macintosh 10 RPG 2016-01-13 Apple RPG
18432 Masterpiece Xbox One 10 Action 2015-08-24 Xbox Action
18433 Masterpiece PlayStation 4 10 Action 2015-08-24 PlayStation Action
18434 Masterpiece PC 10 Action 2015-08-24 Windows Action
18067 Masterpiece PC 10 Action, Adventure 2015-04-14 Windows Action
17998 Masterpiece PlayStation 4 10 Action, Adventure 2014-11-17 PlayStation Action
17999 Masterpiece Xbox One 10 Action, Adventure 2014-11-17 Xbox Action
17834 Masterpiece PlayStation 4 10 Action 2014-07-28 PlayStation Action
17163 Masterpiece PlayStation 3 10 Action, Adventure 2013-09-16 PlayStation Action
17164 Masterpiece Xbox 360 10 Action, Adventure 2013-09-16 Xbox Action
17128 Masterpiece PlayStation 3 10 Action 2013-06-05 PlayStation Action
16342 Masterpiece iPhone 10 Fighting, Action 2011-11-30 Apple Action
16350 Masterpiece Wii 10 Action, Adventure 2011-11-11 Nintendo Action
16279 Masterpiece PlayStation 3 10 Action, Adventure 2011-10-24 PlayStation Action
15757 Masterpiece Wii 10 Action, RPG 2011-05-25 Nintendo Action
15135 Masterpiece PlayStation 3 10 Action 2010-11-30 PlayStation Action
15237 Masterpiece Xbox 360 10 Action 2010-11-16 Xbox Action
15320 Masterpiece Xbox 360 10 Action, Adventure 2010-10-27 Xbox Action
15321 Masterpiece PlayStation 3 10 Action, Adventure 2010-10-27 PlayStation Action
14689 Masterpiece Wii 10 Platformer 2010-05-20 Nintendo Platformer
11121 Masterpiece PlayStation 3 10 Action 2008-06-13 PlayStation Action
11031 Masterpiece PlayStation 3 10 Action 2008-06-12 PlayStation Action
10836 Masterpiece Xbox 360 10 Action 2008-04-29 Xbox Action

Grand Theft Auto IV and The Last of us are in the right place. But where is my favourite and phenomenal The Witcher series. I feel little disappointed with the dataset. Also do not find anything that could be named masterpiece in Tormado Mania, Pokemon and Sonic games. It is a pity that the dataset does not contain how many votes particular game had, because based on that amount the mean score and score phrase are granted.

Let’s look at Disaster games

disaster<- IGN_data[,-c(3,7)] %>% 
  filter(score_phrase == "Disaster" ) %>% 
  unique()
  
kable(disaster[1:10, ], caption= "Disaster score_phrase records")
Disaster score_phrase records
X score_phrase platform score genre date platform_group genre_group
1 12513 Disaster Wii 0.8 Racing 2009-02-11 Nintendo Racing
2 5242 Disaster Wireless 0.5 Racing 2003-10-28 Other Racing
3 890 Disaster PC 0.7 Action 1998-10-29 Windows Action
NA NA NA NA NA NA NA NA NA
NA.1 NA NA NA NA NA NA NA NA
NA.2 NA NA NA NA NA NA NA NA
NA.3 NA NA NA NA NA NA NA NA
NA.4 NA NA NA NA NA NA NA NA
NA.5 NA NA NA NA NA NA NA NA
NA.6 NA NA NA NA NA NA NA NA

It seems that “Masterpiece” score_phrase was reserved for games with score equal to 10 and Disaster for score below 1. What interest me a lot is how many Masterpiece games was in years and how it was changing via time

IGN_data %>% 
  filter(score_phrase == "Masterpiece") %>% 
  mutate(year= year(date)) %>% 
  count(year) %>% 
  plot_ly(x = ~year, y = ~n) %>% 
  add_bars()

That’s surprising… although games can be done more quickly, sometimes without that putted heart in process of doing it they are getting better and more often we are keen to name them a Masterpiece.. or maybe we are just more positive and less demanding ?? Technology and how the graphic was improved can also be the purpose.

Lets do the same with the rest of the score_phrases

IGN_data_score_phrase<- IGN_data %>% 
  mutate(year= year(date)) %>% 
  count(year, score_phrase)


ggplot(IGN_data_score_phrase, aes(year, n)) +
geom_col(show.legend = FALSE) +
facet_wrap(~score_phrase, nrow=5, scales = "free")

Looks great. I see that there is something strange in year on Disaster chart but that is because of separating year from date, so it is no in date format any more. That’s why we can see that there are year’s halves on the plot. What first come to my mind is to transform it to factor so the year will become be a level but if you would try you will see that all the years will automatically appear on the axis and the chart wont be clear any more.

Plotting it in that form may not be so accurate as looking at it separately but on quick glance of it is seen that most of them have similar decreasing trend or do not have any trend.

One last chance to check score_phrase trend would be gathering negative one in one chart. As negative I understand Disaster, Painful, Awful, Unbearable, and Bad.

IGN_data %>% 
  filter(score_phrase == c("Disaster","Painful","Awful","Unbearable","Bad")) %>% 
  mutate(year= year(date)) %>% 
  count(year) %>% 
  plot_ly(x = ~year, y = ~n) %>% 
  add_bars()

Woohoo the trend of negative games is visibly strongly decreasing. Great news, moreover there more Masterpiece games.

Have you spotted something like golden year in games history between 2005 and 2010 ? Lets focus on that years and find the golden one (so with maximum count) for every phrase

a<- summarize(group_by(IGN_data_score_phrase, score_phrase), n= max(n))
total<- merge(a,IGN_data_score_phrase, by= c("score_phrase","n"))


ggplot(total, aes(score_phrase, fill= as.factor(year)))+ 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 40, hjust = 1))

This graph is confessing that there was a gold year and it was year 2008. Why? Go to date bookmark.

Score

After analysing the score_phrases we can supposed that they are somehow related. What I mean is that Disaster correspond to some average score. Let’s investigate what average score was connected to which score_phrase

score1<- IGN_data %>% 
  group_by(score_phrase) %>% 
  mutate( mean_score= mean(score)) %>% 
  count(score_phrase, mean_score)

kable(score1)
score_phrase mean_score n
Disaster 0.6666667 3
Painful 2.2679412 340
Awful 3.2902108 664
Unbearable 1.2902778 72
Bad 4.3315997 1269
Mediocre 5.3185299 1959
Okay 6.3665535 2945
Good 7.3690993 4741
Great 8.3165095 4773
Amazing 9.1766630 1804
Masterpiece 10.0000000 55

Hah, seems like my levels was not corrected arranged, as you can see Unbearable should be after Painful. I will quickly change it.

order<- c("Disaster","Unbearable","Painful","Awful","Bad","Mediocre","Okay","Good","Great","Amazing","Masterpiece")
IGN_data$score_phrase<- ordered(IGN_data$score_phrase, levels = order)

In advance we can see how many records are in different score_phrase and that rate is changing by about one point.

Platform

First let’s see unique platforms the dataset contains

unique(IGN_data$platform)
##  [1] PC                   Xbox One             PlayStation 4       
##  [4] Nintendo 3DS         Android              iPhone              
##  [7] PlayStation Vita     Wii U                PlayStation 3       
## [10] New Nintendo 3DS     Macintosh            Xbox 360            
## [13] iPad                 Linux                SteamOS             
## [16] Windows Phone        Ouya                 Wii                 
## [19] Windows Surface      PlayStation Portable Nintendo DS         
## [22] Arcade               Web Games            Nintendo DSi        
## [25] Wireless             PlayStation 2        Super NES           
## [28] NES                  Commodore 64/128     Master System       
## [31] Genesis              TurboGrafx-16        NeoGeo              
## [34] Sega 32X             TurboGrafx-CD        Saturn              
## [37] Atari 2600           Game Boy             Sega CD             
## [40] Vectrex              iPod                 Atari 5200          
## [43] Xbox                 GameCube             Game Boy Advance    
## [46] Dreamcast            N-Gage               Pocket PC           
## [49] PlayStation          Game Boy Color       DVD / HD Video Game 
## [52] Nintendo 64          WonderSwan Color     Lynx                
## [55] Nintendo 64DD        NeoGeo Pocket Color  WonderSwan          
## [58] Game.Com             Dreamcast VMU       
## 59 Levels: Android Arcade Atari 2600 Atari 5200 ... Xbox One

Wow, quite a lot of them. Now I would like to see the first and the last platform according to date

#Min
min<- IGN_data %>% 
  filter(date== min(date)) %>% 
  count(platform)
kable(min)
platform n
Saturn 1
#Max
max<- IGN_data %>% 
  filter(date== max(date)) %>% 
  count(platform)
kable(max)
platform n
PC 1
PlayStation 4 2
Xbox One 2

Something strange that Xbox is the first platform in the dataset. When we check:

min(IGN_data$date)
## [1] "1996-06-18"

clearly it is seen that it is the default one. It will for super affect on further charts if won’t be changed. If there was bit more information (Xbox360 is not so unique platform) we could estimate the year or even try to predict from built model. For now I decided to delete this row. First we need to know the row number (column is number 7)

look.for<- min(IGN_data$date)
IGN_data<- IGN_data[! IGN_data$date %in% look.for, ]

#Check
min(IGN_data$date)
## [1] "1996-09-25"
#Woohoo, now let's show proper platforms, the oldest one

min<- IGN_data %>% 
  filter(date== min(date)) %>% 
  count(platform)
kable(min)
platform n
Nintendo 64 2

Now I can correlate games score_phrase with platform (top 10)

top_10<- IGN_data %>%
  count(platform, sort=TRUE) %>% 
  head(10)


IGN_data %>% 
  filter(platform == top_10$platform) %>% 
  ggplot(aes(x= ordered(platform,top_10$platform), fill= score_phrase)) + geom_bar() + theme(axis.text.x = element_text(angle = 40, hjust = 1)) +labs(x= "Top 10 platforms")
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(platform, top_10$platform): longer object length is
## not a multiple of shorter object length

Genre

Ok, I suppose there a quite a lot of unique once, just quickly check this one:

vec<- unique(IGN_data$genre)
str(vec)
##  Factor w/ 113 levels "","Action","Action, Adventure",..: 12 83 95 77 88 3 2 93 70 106 ...
# Ohh 113 that is a lot, but how you probably saw the genre is duplicated when connected to other
columns<- c("genre", "score_phrase")
IGN_data_second<- IGN_data[,columns]

#I have an idea to deal with bigrams of genre, check this:
IGN_data_second<- IGN_data_second %>% 
  separate(genre, c("genre1", "genre2"), sep = ",")
#Genres are separated now, lets save second one with connected score_phrase, remove NA values and join to existing one. I will IGN_dataore that score value for one game would be doubled.

columns2<- c("genre2", "score_phrase")
IGN_data_third<- IGN_data_second[, columns2]
IGN_data_third<- IGN_data_third[complete.cases(IGN_data_third), ]

columns3<- c("genre1", "score_phrase")
IGN_data_second<- IGN_data_second[, columns3]
colnames(IGN_data_second)<- c("genre", "score_phrase")
colnames(IGN_data_third)<- c("genre", "score_phrase")

final_IGN_data<- rbind(IGN_data_second, IGN_data_third)
str(final_IGN_data)
## 'data.frame':    21240 obs. of  2 variables:
##  $ genre       : chr  "Adventure" "RPG" "Sports" "Sports" ...
##  $ score_phrase: Ord.factor w/ 11 levels "Disaster"<"Unbearable"<..: 9 7 9 9 9 10 9 7 9 9 ...

Time to summarise it with the chart: pay attention how action games run off other genres, also look at those great labels on the chart while zooming one score phrase.

#Unique genres now
vec<- unique(final_IGN_data$genre)
str(vec)
##  chr [1:55] "Adventure" "RPG" "Sports" "Racing" "Shooter" "Action" ...
#Select top 10 now
top_10_genre<- final_IGN_data %>% 
  count(genre, sort=TRUE) %>% 
  top_n(10)
## Selecting by n
#Chart
vector<- top_10_genre$genre
final_IGN_data %>% 
  filter(genre == vector) %>% 
  group_by(genre) %>% 
  count(genre, score_phrase) %>% 
  plot_ly(x = ~genre, y = ~n, type = 'bar', color = ~score_phrase) 

Date

Hey, we have seen that 2008 was a special year. Are you interested what was special that period? Maybe some new technology went on the market? It could be a good start so let’s select the year each technology was released and then filter selected year.

b<- summarize(group_by(IGN_data, platform), year= min(year(date)))
b %>% 
  filter(year == "2008")

As for my knowledge all of them except for iPhone that now is celebrating his 10th birthday, was released before 2000, some of them even before 1980. Is iPhone the purpose? It will be good to check if how many of games released in 2008 was for iPhone

#total games in 2008
games_2008<- IGN_data %>% mutate(year = year(date)) %>% filter( year == "2008") 
games_2008 %>% count(year)
games_2008 %>% filter(platform == "iPhone") %>% count(platform)

Not so big.. that obviusly wasn’t a purpose of producing more games.

Trend charts

Every year ratio of genres for each score_phrase is changing. Are you interested in what direction is it going? I will work on selected, let’s take four, score_phrases to prepare one chart gathering year- genre trend.

Let’s take: Amazing, Awful, Okay, Bad. There are quite a lot of them in the dataset so the line chart will be more smooth than for Mastepiece score_phrase

choosed_columns<- c("score_phrase", "genre", "date")

selected_data<- IGN_data[,choosed_columns] %>% 
  filter(score_phrase == c("Amazing", "Awful", "Okay", "Bad")) %>% 
  mutate(year= year(date))

selected_data<- selected_data[,-3]

Just like for the bigrams of genre lets repeat those steps

selected_data<- selected_data %>% 
  separate(genre, c("genre1", "genre2"), sep = ",")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1368 rows
## [3, 4, 5, 6, 8, 9, 10, 12, 13, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25,
## 27, ...].
columns_a<- c("score_phrase", "genre2", "year")
selected1<- selected_data[, columns_a]
selected1<- selected1[complete.cases(selected1), ]

columns_b<- c("score_phrase", "genre1", "year")
selected2<- selected_data[, columns_b]
colnames(selected1)<- c("score_phrase", "genre", "year")
colnames(selected2)<- c("score_phrase", "genre", "year")

final_masterdata<- rbind(selected2, selected1)
str(final_masterdata)
## 'data.frame':    1840 obs. of  3 variables:
##  $ score_phrase: Ord.factor w/ 11 levels "Disaster"<"Unbearable"<..: 7 10 10 4 7 7 7 10 7 10 ...
##  $ genre       : chr  "Action" "Action" "Adventure" "Shooter" ...
##  $ year        : num  2016 2016 2016 2016 2016 ...

I will create count table to sum up number of games in particular year. Than after checking count of games in different group in year (I mean score_phrase and genre) the ratio column can be calculated.

count<- final_masterdata %>% 
  group_by(year) %>% 
  count(score_phrase, year)

final_masterdata<- final_masterdata %>% 
  group_by(year) %>% 
  count(score_phrase, year, genre)

final_masterdata<- merge(final_masterdata, count, by=c("year", "score_phrase"))
final_masterdata<- final_masterdata %>% 
  mutate(ratio = n.x/n.y)

Final and my favourite part: let’s start plotting

#Action
final_masterdata %>% 
  filter(genre == "Action") %>% 
  ggplot(aes(x=year, y=ratio, group= score_phrase, colour=score_phrase)) + geom_line()

Doesn’t show much, right? Maybe separating it would bring more information

#Action
final_masterdata %>% 
  filter(genre == "Action") %>% 
  ggplot(aes(x=year, y=ratio)) + geom_line() +facet_grid(score_phrase ~.)

Quick play with word cloud

Hey, just to show you how fun and powerful wordcloud can be I will show you few examples. We have score phrases that are perfect for that. First, lets show according to letter measures, size of score_phrases groups

IGN_data %>% 
  count(score_phrase) %>% 
  with(wordcloud(score_phrase, n, max.words= 50))

I would like to check if that proportion will change for action genre, while this is the biggest group

IGN_data %>% 
  filter(genre == "Action") %>% 
  count(score_phrase) %>% 
  with(wordcloud(score_phrase, n, max.words= 50))